knitr::opts_chunk$set(warning=FALSE, message=FALSE)

1 Welcome

data gathered from https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases

2 Cleaning data

Here I am doing a few things:

options(stringsAsFactors=FALSE)
library(plyr)
library(dplyr)
library(reshape2)
library(ggplot2)
library(plotly)
library(highcharter)

#function for grabbing status from my filenames
getStatus<- function(x){
  strsplit(x, "-")[[1]] %>%
    last() %>%
    gsub(pattern="\\.csv", replacement="")
}

#function created for adding an active status
createActive <- function(x){
  dcast(Country.Region + Date ~ Status,
        data=x, value.var="Value") %>%
    mutate(Active = Confirmed - (Deaths + Recovered)) %>% 
    melt(id = c("Country.Region", "Date"))
}

#function used to convert a dataset from long to wide
convert <- function(x){ dcast(Country + Date ~ Status,
        data=x, value.var="Value")}

###########################################
### This is where I start cleaning the data
files <- list.files("raw", full.names = TRUE)

raw <- files %>% #read in files
  lapply(function(x){
  read.csv(x) %>% 
    mutate(
      Date = as.Date(Date, "%m/%d/%Y"),
      Status = getStatus(x) #add status
    )
}) %>% 
  bind_rows() %>% #combine files
  subset(!(Value==0) )

raw <- raw %>% #combine countries into one
  group_by(Country.Region, Date, Status) %>% 
  summarise(
    Value = sum(Value)
  )

raw <- raw %>% #get global stats
  group_by(Date, Status) %>% 
  summarise(
    Value = sum(Value)
  ) %>% 
  ungroup() %>% 
  mutate(
    Country.Region = "Global"
  ) %>% 
  bind_rows(raw)

raw <- raw %>% #create active columns, delete nulls, rename
  createActive() %>% 
  subset(!is.na(value)) %>% 
  rename(
   Country = Country.Region,
    Value = value,
    Status = variable
  )

total <- raw %>% #Get total values for seperate df
  group_by(Country, Status) %>% 
  summarise(
    Value = max(Value)
  ) %>% 
  ungroup() %>% 
  mutate(
    Status = as.character(Status)
  )

#get change per day
raw <- raw %>%
  group_by(Country, Status) %>% 
  mutate(
    Change =  Value - lag(Value, k=1),
    Rate_Change =  (Value - lag(Value, k=1))/lag(Value, k=1) 
    ) %>% 
  ungroup() %>% 
  mutate(
    Status = as.character(Status)
  )

Now that the data is clean, lets start digging into the data!

3 A Quick Analysis

3.1 Cumulative Total Cases

The above graph shows the variance of cumulative cases by type, and by country. We can see some pretty interesting things here. For example, Italy has the highest deaths and active cases, yet they are only second in the number of confirmed cases. This could mean a few things:

  • Italy may not have been testing enough people, which would bring their Mortality Rate up.

  • Italys’ healthcare system is getting overrun, and their deaths are about to skyrocket.

This graph perfectly shows just how many Deaths Italy has compared to our top 5 countries. At the time of writing this, Italy has just about doubled their death count over Chinas’!

For the remainder of this analysis, I will be focusing on Italy.

3.2 Rates of Change

Theres a couple things to keep in mind when looking at the rate of change.

  • It has more variance in the beginning.

  • As the number of cases get larger, the rate will tend to even out, or even decline.

This is the basic Exponential growth function:

\(f(x) = P_0(1+r)^d\)

The rate is what gets exponentiated for an exponential function, ie, if our mean rate is higher, our confirmed cases will get exponentially higher with each day. Also, if the rate is negative that means our cases per day has dropped from the previous days’ number. If there starts to be a trend of negative rates, this means that the disease is starting to slow down, which could mean we are past the peak.

This graph clearly shows that the deaths rate of change has not only a higher average, but most of the rates are higher than the rate of confirmed. This tells us the number of deaths are going up much faster than our confirmed cases are.

There are a multitude of reasons why this could happen:

  • The Confirmed cases are lagging behind, due to the incubation period.

  • The Confirmed cases are innacurate, as finding this true value is difficult.

  • Our Mortality rate is not static

    • This is a likely scenario for a few reasons.
      • Healthcare systems can be overrun, causing more deaths.
      • Population ages vary depending on location, and we know Covid-19 has a higher Mortality rate among an older population.
  • Rates tend to even out over time, meaning in the beginning stages the rate of change will be much higher than later on.

    • If we have 1 person infected today, and another tomorrow, the rate of change from day 1 to day 2 is 2.0

3.3 Deeper look at Italys’ Death Rate of Change

Looking at this scatter plot, as the number of Deceased cases rise, the Rate of Change trends downwards. This means that Italy has a negative correlation between the actual change in cases per day and the rate of change. This is a good thing, as it could mean a few things:

  • The infection rate is starting to slow down.
  • The rate of change is starting to average out to a lower number.

Lets find out how well correlated the two actually are, using pearsons’ method.

Before we start, I will get the outliers out of the data, as some of the early rates are innacurate, for the reasons described before.

First, lets check to see if our Rate of Change is relatively normal.

## 
##  Shapiro-Wilk normality test
## 
## data:  std_corr$Rate_Change
## W = 0.94448, p-value = 0.1314

This tells us that our p value is 0.1314 > 0.05

This is great news! It means our data is normal, and this means that we can use our Rate of Deaths as a predictor for our model!

3.4 Finding the significance of Italys’ Rate of Deceased Cases

As expected there is a negative correlation between the Rate of Change and Total cases, aswell as a negative correlation between our Rate of Change and the cases per day.

This makes sense as the amount of cases increase, our rate should start having less variations. This is also great news, as it means our rate is trending downwards, ultimately suggesting that the rate of Deaths per day is slowing down.

The r squared value can tell us much more, for example, we know that 88.7% of our Cases per day can be explained by our Total cases! This also tells us that 29.4% of our Cumulative cases can be explained by the variations in the Rate of Change.

Let’s check if these values are statistically significant, to do that we need the below formula to get our test statistic.

\({ts} = \frac{r*\sqrt[2]{n-2}}{\sqrt[2]{1-r^2}}\)

##                 Value    Change Rate_Change
## Value             Inf 14.245784    3.355033
## Change      14.245784       Inf    2.206968
## Rate_Change  3.355033  2.206968         Inf

These are our test statistics for each r value, now we need to use the below formula to check if the r value is statistically significant.

\({-ts} > r > ts\)

Here is a matrix of our r values.

##                  Value     Change Rate_Change
## Value        1.0000000  0.9394569  -0.5424324
## Change       0.9394569  1.0000000  -0.3909310
## Rate_Change -0.5424324 -0.3909310   1.0000000

This tells us that our all our correlations are statistically significant! Which means that we can use all of our factors as a predictor of the other.

I will be using the correlation between Italys’ Deaths rate of change to predict our Deaths per day values, as this is a statistically significant correlation, and the math lines up.

4 Building a model of Italys’ Deaths per day

4.1 Facts before the build

There are two things I want to point out about Covid-19:

  • We know the median incubation period is 5 days.
    • This means that there is most likely a 5 day lag in our confirmed cases, as 50% of infected individuals will not experience any symptoms before the incubation period.
  • We know the average days from illness onset to death is 18 days.
    • This can help us estimate the amount of infections 23 days prior.

Using these two facts, we can estimate the average time from point of infection to death is 23 days (5 incubation days + 18 days till death). This means that if someone dies today from Covid-19 we can estimate that they got infected 23 days prior.

sources:

https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19?source=history_widget

https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)30566-3/fulltext

I want to first build a model of Italys’ predicted Deaths per day, this can further help us find the actual number of confirmed cases!

4.2 Building the predicted Deaths per day

There are a few things I want to note:

  • The reported Deaths are likely to be the most accurate.
    • This is due to the fact that there are likely much more confirmed cases of Covid-19 than what is reported.
      • Covid-19 is known to be transmittable by people without showing symptoms.
      • Not everyone will be tested.
  • Using the Deaths rate of change, we can estimate the future Deaths to be reported.

So lets get started!

4.2.1 The math behind using Deaths

\({Deaths}={Infected}*{MortalityRate}\)

This is our formula to find the number of Deaths from any given population of infected individuals. We are focusing on the Deaths per day, as I want a per day model.

This means we need to take the derivative of our formula, with respect to time. This leads into a two options:

  • The Mortality Rate is a constant value for the population.

  • The mortality Rate is a value that also changes with time.

\(D=Deaths\)

\(I=Infected\)

\(R=MortalityRate\)

If we take the derivative with Mortality Rate as a variable we get the following:

  • \(\frac{dD}{dt}=\frac{dI}{dt}*\frac{dD}{dI}+\frac{dR}{dt}*\frac{dD}{dR}\)

  • \(\frac{dD}{dt}=\frac{dI}{dt}*R+\frac{dR}{dt}*I\)

In plain terms, the change in Deaths is equal to the change in infections, time the Mortality Rate, plus, the change in Mortality Rate, times the infections.

There is one major problem with this, we cannot make an accurate estimate on the Mortality Rate, which means we also cannot make an accurate estimate on the change in Mortality Rate. So, while this would be the better option, I do not believe it is feasable with the data we have now, as we cannot get an accurate Mortality Rate.

However, taking the derivative with Mortality Rate as a constant gives us the following:

  • \(\frac{dD}{dt}=\frac{dI}{dt}*R\)

which also tells us..

  • \(\frac{dI}{dt}=\frac{\frac{dD}{dt}}{R}\)

In plain terms, we get that the change in Deaths is equal to the change in the infections, times the Mortality Rate. We also know that the change in infections is equal to the change in Deaths, divided by the Mortality Rate.

This means that with a constant Mortality Rate, the change in Deaths will also be the same as the change in Infections! As the rate is just scaling the numbers. This tells us we can make predictions on the change in Infections by using the change in Deaths, scaling with the Mortality Rate afterwards!

Using this information, I will build my model with Mortality Rate as a constant.

4.3 The math behind the Monte Carlo Simulation

\(f(t)=P_0*(1+r)^t\)

We are calculating this one day at a time, which means our t value will be 1, giving us this:

\(f(1)=P_0*(1+r)\)

\(\frac{f(1)}{P_0}=1+r\)

\(r=\frac{f(1)}{P_0}-1\)

\(r=\frac{\Delta f(1)}{P_0}\)

\(s.t.f(0)=P_0\)

\(r=\frac{f(1)-f(0)}{f(0)}\)

FIXING

\(r_1-r_0=\frac{f(1)-f(0)}{P_0}\)

\(r_1=\frac{f(1)-f(0)}{P_0}+r_0\)

\(f(1)=(r_0*P_0)-(r_1*P_0)+P_0._\square\)

In plain terms, the day to predict is equal to the change in our rate, times the previous day, plus the previous day. This means, if we try to predict the change in our rate, we can find our predicted Cumulative Deaths!

4.4 The Algorithm

To start, there are a few things to note about how I built this algorithm:

  • I will build it using the mean and standard deviation from the past 8 days.
    • What this means is, we will only look at the last 8 rates of change to base our next day off of.
    • Once we predict a future rate, we throw away the oldest rate and re-model our predictions!
      • This is to make sure that our model changes in accordance with our predictions, otherwise we would start to see a more linear aspect to our predictions.
  • I have split this algorithm into multiple functions to stop at certain points, in order to explain whats going on.

Lets dive in by taking a look at how our rates of change work!

5 Prediction Analysis and building

To start, we know how much our Deaths are changing per day, we can formulate what percentage this change is from our cumulative Deaths. To do this, we need the following formula:

\(r=\frac{\Delta f(t)}{P_0}\)

In plain terms, this formula takes the change per day of our deaths, and divides that by our previous days cumulative Deaths! Making the deaths per day into a percentage of our cumulative deaths!

##          Date Deaths Deaths_PD Deaths_RC
## 27 2020-03-19   3405       427 0.1433848
## 28 2020-03-20   4032       627 0.1841410
## 29 2020-03-21   4825       793 0.1966766
## 30 2020-03-22   5476       651 0.1349223
## 31 2020-03-23   6077       601 0.1097516

Now that this step is complete, we can calculate the change in our death rate of change. Essentially looking at how much our rates are fluctiating over time. We can do this simply by taking our first deaths rate of change and subtracting the second days rate of change from that. This will mean if the rate goes up, the change will go up as well!

##          Date Deaths Deaths_PD Deaths_RC      D_RC_C
## 27 2020-03-19   3405       427 0.1433848 -0.04638745
## 28 2020-03-20   4032       627 0.1841410  0.04075615
## 29 2020-03-21   4825       793 0.1966766  0.01253562
## 30 2020-03-22   5476       651 0.1349223 -0.06175431
## 31 2020-03-23   6077       601 0.1097516 -0.02517064

Great, here are Italys’ cumulative Deaths, Deaths per day, Deaths rate of change, and the change in the rate of change. We will use all this information to work backwards at the end, to fill in the predictions.

This is the distribution of the rate of change, we want to take a look at this to make sure it is relatively normal, otherwise there will be a different method needed to make predictions from.

## 
##  Shapiro-Wilk normality test
## 
## data:  italy$D_RC_C
## W = 0.89634, p-value = 0.2677

Exactly what we wanted to see, our p value is > 0.05, meaning our distribution is normal enough to find the probabilities!

5.1 Building the predictions

Now that we have the changes per day, we can build the prediction model. I will break this down into steps:

  • Use a changing mean and standard deviation:

    • Think of this as a revolving door, we only want to look at the past 8 days’ changes.
  • Generate a random number for each day, per trial

  • Calculate backwards to get the Cumulative Deaths

### functions used inside function below
get_rc <- function(death_rc_n1, change){
  #rc = lag death_rc + change
  rc=death_rc_n1+change
  return(rc)
}

get_pd <- function(deaths_n1, deaths_rc){
  #lag death*roc = pd
  
  pd = (deaths_rc*deaths_n1)
  if (pd <0){
    pd = pd * -1
  }
  
  return(ceiling(pd))
}

get_death <- function(lag_death = lag_death, dpd){
  #lag death + death per dat = next cumulative death
  if(dpd <0){
    dpd <- dpd * -1
  }
  
  death = lag_death+dpd
  return(ceiling(death))
}

### function used for prediction model
prediction_model <- function(data, days, trials, split, curr_d){
  curr_d = max(data[data$D_RC_C != 0,"Date"])
  
  for (n in 1:trials){
    temp <- data %>% 
      subset(select = c(Date, D_RC_C))

    for (i in 1:days){
      
      mu <- temp %>% #grabbing mean
        subset(select = D_RC_C) %>% 
        slice(nrow(temp)-(nrow(temp)-split):nrow(temp)+i) %>% 
        as.matrix() %>% 
        mean()
      
      sd <- temp %>% #grabbing standard deviation
        subset(select = D_RC_C) %>% 
        slice(nrow(temp)-(nrow(temp)-split):nrow(temp)+i) %>% 
        as.matrix() %>% 
        sd()
      
      set.seed(i*13*n)
      temp <- temp %>% 
        bind_rows(data.frame(Date = as.Date(curr_d, format = "%Y-%m-%d")+i, D_RC_C = rnorm(1, mu, sd))) %>% 
        mutate(
          Trial = n
        )
    }
    
    if (n == 1){
      trial_data <- temp
    }
    else{
      trial_data <- temp %>% 
        bind_rows(trial_data)
    }
  }
  
  return (trial_data)
}
curr_d <- max(italy[italy$D_RC_C != 0,"Date"])

TRIALS <- 20
DAYS <- 10
SPLIT <- 8

italy <- italy %>% 
  prediction_model(days = DAYS, trials = TRIALS, split = SPLIT, curr_d) %>% 
  group_by(Trial, Date) %>% 
  slice(n()) %>% 
  ungroup() %>% 
  mutate(
    Deaths = NA,
    Deaths_PD = NA,
    Deaths_RC = NA,
    Country = "Italy"
  )

5.2 Taking a look at our predicted Deaths

As shown, these are the 20 simulations for a 10 day forecast. Lets take a deeper look.

Here we can see the clustering of our Simulations. It seems the farther the forecast, the less clustered it becomes. This means that the Variance is growing as we predict the days, something that is expected as this is using a Markov Technique.

Lets take a look at the distribution of these indivdual days!

This shows what exactly is going on with each day that the algorithm predicts. As the number of predicted days increase, the distribution of our Cumulative Deaths flatten out. This means that there is more variation, or ‘less accuracy’, with each day that the algorithm predicts. This tells us that it will be more adventageous to use this algorithm as a short term predictor, rather than a long term predictor.

Now lets take a look our individual rates!

Now we can see that there is a clear split between rates that are going downwards and rates that are climbing upwards! This is great news, as it can generally imply that the Rate of Change will start to climb down, suggesting that what has been going on in the past 8 days is directly effecting the rate at which Covid-19 is spreading!